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Abstract 

We consider nonparametric functional regression when both predictors 
and responses are functions. More specifically, we let {Xi,Yi), . . . , {Xn,Yn) 
be random elements m. T x % where is a semi-metric space and % is 
a separable Hilbert space. Based on a recently introduced notion of weak 
dependence for functional data, we showed the almost sure convergence rates 
of both the Nadaraya- Watson estimator and the nearest neighbor estimator, 
in a unified manner. Several factors, including functional nature of the 
responses, the assumptions on the functional variables using the Orlicz norm 
and the desired generality on weakly dependent data, make the theoretical 
investigations more challenging and interesting. 
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1 Introduction 

The problem of regression with functional predictors has been receiving increasing 
interests nowadays, boosted by more and more datasets with observations that can 
be naturally perceived as curves. This trend starts with the popular monograph 
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24j | that gives a detailed exposition of functional linear models. The existing lit- 



erature contains numerous theoretical and empirical studies on functional linear 
models [sl, [isi 21, 29, 0, [isl, H, [oj. Nonparametric methods with functional 
predictors and scalar responses appear later 0,1231, 3, which by now has 

been widely accepted by the statistical community as a more flexible approach 
to functional regression with fewer structural assumptions imposed. As this area 
naturally develops and matures, the situation where the responses are also curves 
begins to receive more attention [l], 0, [13] • For example, one might predict an- 
nual precipitation using temperature measurements as in^5[, or predict future 
hourly electricity consumption based on past history as in j2[ . Although these two 
studies follows the parametric approach to functional regression, it is clear that 



nonparametric approach is a viable alternative [20 



On the other hand, the assumption of independence in most theoretical investi- 
gations carried out so far is often too restrictive in many applications. The necessity 
to respo nd p roperly to data dependence is clearly demonstrated by the example 
given in 10| where a functional observation denotes the monthly electricity con- 



sumption over a year and thus it is unrealistic to assume that electricity consump- 
tion in one year is independent that of the previous year. In previous studies regard- 
ing nonparametric functional regression, dependence is incorporated based on some 
mixing conditions 12|. Here we instead use the notion of L"^ — m— approximability 



advocated in 16|, [l^ (with some appropriate minor extensions). The advantage 
compared to using mixing conditions is that the L^ — m— approximability condition 



m 



is easily verified in many examples as shown in 

In the more classical setting, the observation pairs reside in the Euclidean 
spaces. In this paper, we carry out a theoretical investigation of nonparametric 
functional regression with functional responses on dependent data. Two related 
classes of nonparametric estimates have been proposed, the k-nearest neighbor 
estimate (k-NN) and the Nadaraya- Watson kernel estimate. Because of their sim- 
ilarity in many aspects, we will try to unify the proofs for these two as much as 
possible. We will show almost sure convergence of these nonparametric estimators 
based on assumptions on Orlicz norms of the functional variables. Due to the 
functional nature of the responses and the assumption of weak dependence, the 
theoretical investigation poses serious challenges and some novel construction of 
martingale difference sequence will be introduced. Finally, we note that through- 
out the paper we use C to denote a generic constant that assumes different values 
at different places. 
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2 Almost sure convergence of nonparametric es- 
timates 



2.1 On the notion of Orlicz norm and weak dependence 

In this subsection we review the concept of Orhcz norm and collect some of its 
simple properties as a lemma here for easy reference later. Although all of the 
properties are simple and most are well-known, some others seem to be new (such 
as Lemma [1] (vi)(vii)) which we cannot find in the existing literature. We also 
review and extend the notion of — m— approximability of a data sequence using 
the more general Orlicz norm instead of norm. 



Following [28|, let ip he a. convex, increasing function on [0, oo) with '?/'(0) = 
and let X be a real-valued random variable. The Orlicz norm (or ?/;-Orlicz norm 
to emphasize its dependence on ifj) is defined as 

||X||^ = inf{C> : E[^(||X||/C)] < 1}, 

which can be shown to be indeed a norm. For random elements X taking values in 
a normed space, the Orlicz norm of ||X|| (which is a real- valued random variable) 
is also denoted by for simplicity. 

There are two commonly used tp function: ip^x) = and iIj{x) = exp{x^} — 1, 
p > 1, and throughout the paper we use ipp to denote the latter. With ipi^x) = x^, 
the Orlicz norm is simply the norm {E[XP]y/P. With?/'(a;) = ipp^x) = exp{xP}- 
1, the finiteness of Orlicz norm of X is closely related to the exponential decay 
of its tail probability, the exact statement of which is contained in the following 
Lemma together with other simple properties concerning Orlicz norm. 

Lemma 1 Below we assume ip is a valid function that defines an Orlicz norm, 
that is, ip is convex, increasing on [0, oo) with '?/'(0) = X is a random variable. 

(i) P{\X\ >x)< l/^(x/||X||^),Vx > 0. 

(ii) If P{\X\ > x) < i^exp{— Cx^} for all x > and some constants K and C, 
then \\X\\^,^ < {{1 + K)/Cy/P. 

(Hi) If ip{x) = ip{ax) for some a > 0, then ||X||^ = a||X||^,. 

(iv) If il){x) < ail){x) for some a>l, then ||X||^ < a||X||^. 

(v) If il){x) = (f){tlj{ax)) for some a > and some concave increasing function (p 
with 0(0) = and 0(1) = 1, then \\X\\^ < a||X||^. 
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(vi) If-ip{x) := tp{x^^^),p > 1 is convex, then \\\X\p\\^ < 

(vii) \\E[X\Q]\\^ < \\X\\^, for any a-algehra Q. 

Proof. Results (i) and (ii) can be found in Section 2.2 of jisj. (iii) is obvious 
by the definition of Orlicz norm. To prove (iv), we note that i??/'(|X|/a||X||^) < 
aEil){\X\/a\\X\\^) < Eil){\X\/\\X\\^) < 1, where we used that ip{x/a) < i){x)/a 
due to the convexity oiip. For (v), since E^{\X\/ a\\X\\^) = E(j){il){\X\/\\X\\^)) < 
4){E'il}{\X\/\\X\\^)) < (f)(1) = 1 (using Jensen's inequahty), we get < a||X||^ 

by definition. For (vi), the result follows from E^(|X|7||X||^) = EiIj{\X\/\\X\\^) < 

1. Finally, (vii) follows from E^(E[X|^]/||X||^) = Ei;{E[X/\\X\\^\g]) < E{E{i;{X/\\X\\^)\g)) = 

E^{X/\\X\\^) < 1, where we used ^{E[X/\\X\\^\g]) < E[^{X/\\X\\^)\g] due to 

convexity of ip. O 

We already noted that norm is a special case of Orlicz norm when ip{x) = x^. 
On the other hand, based on Lemma [T] (v), one can show that ||X||p < 
for any p,q > I and < C"||X||^^^ if gi < q2, (where C,C' are universal 

constants that only depends on p, q, qi, ^2)- In this sense the norm ||.||^q is stronger 
than L^, and the more so with larger q. 

As explained in the introduction, for data collected sequentially over time, the 
assumption of independence is not realistic. In [Tg!], the authors formalize the 
notion of dependence for functional data using — m— approximability. Instead 
of using the norm which is sufficient for the purpose of those studies, we instead 
use the Orlicz norm here. 

Definition 1 Given a function ifj that defines an Orlicz norm, a sequence {Xj}^^ 
(taking values in a normed space) with finite Orlicz norm is said to be ip—m— approximable 
if we have the representation 

Xi = h{ai,ai-i, . . .), 

where the ak are independent and identically distributed random elements of a 
measurable space and h is a measurable function. In addition, we assume that if 

yv^ CKi—i, . . . , Cti—m+l, Qij_m? ^i-m—1 ■ ■ -Jy 

with a'j. independent copies of a^, then 

00 

\\Xm - x^^'^w^ < 00. 

m=l 

For a ip — m— approximable sequence {Xi}, we say it is ip — m— approximable 
with decay rate if YlZ=k W^rn - xln^\\^ = 0(7^). 
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In [16|, several examples of — m— approximable sequence are given, minor 
modifications of these can produce more general — m— approximable sequences. 
For example, a functional autoregressive process (Example 2.1 in [HI) is ijj — 
m— approximable as long as the innovation noise has finite ^-Orlicz norm, by the 
same arguments. Although not explicitly stated there, a functional autoregressive 
process is — m— approximable with exponential decay rate: 7^ = 0(exp{— Cm}) 
for some constant C. 



2.2 Nonparametric estimates and convergence rate 

Let (Xi, Yi), . . . , {Xn, Yn) be a stationary (in a strong sense) sequence of J-" x "H- 
valued random elements with < 00, where J-" is a semi-metric space with 

semi- metric d{., .) and 7{ is a Hilbert space with norm ||.||. The regression function 
is r{x) = E(Y\X = x) and we can write Yi = r(Xj) + where = Yi — E(Yi\Xi) G 
H are mean zero noises (in the sense of Bochner integral, see [l9|). In this sub- 
section, we always consider probabilities and expectations conditional on {Xi}, in 
effect treating it as fixed. The asymptotic results stated are thus conditional on 
predictors even though we do not state this explicitly in the following. The impli- 
cations of random predictors are treated in the next subsection after we present 
the general convergence results in this subsection. 

The regression function can be estimated by local weighting of responses 

n 

r{x) = ^Wni{x)Yi, (1) {eqn:rhat} 

1=1 

where {Wni{x), . . . , Wnnix)) is a probability vector of weights. Note that Wni{x) 
can be a function of all X^, k = 1, . . . ,n, instead of Xi only, as is the case for 
k-NN estimates (see the examples below). Since in this paper we only investigate 
pointwise convergence at a fixed point x, we will use the notation {Wni, ■ ■ ■ , Wnn) 
in the following for simplicity. 

We rank (Xj,Fj),i = 1, . . . ,n, based on increasing value of d{Xi,x) (ties are 
broken by indices) and obtain a vector (i?i, . . . , Rn) such that Xr^ is the ith nearest 
neighbor of x. Let Vni = WnR^, we can write ([1]) equivalently as 

n 

fix) = ^VniYR^. (2) {eqn:rhat2} 

1=1 

Our consideration of weak dependence leads to extra complications in the proofs. 
If the observations are independent, then obviously Yr^ are also independent. How- 
ever, if {Yi, Y2, . . .) is merely stationary, then {Yr^, Yr^, . . .) is no longer stationary 
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in general since the order of observations are broken. We will thus use represen- 
tation (II]) in most parts of our proofs, although representation ([2]) is easier to 
manipulate in the study of k-NN estimates for independent data. 

Example 1. Simple nearest neighbor estimate. Take Vni = 1/k for i < k and 
Vni = for i > fc, so that the regression function estimate is just the average of 
responses corresponding to the k nearest neighbors of x. Even in this simplest case, 
although Vni is only a deterministic sequence, Wni still depends on all Xj,l < j < n 
since all predictors jointly determine x's neighbors. More generally, we can take 
Vni to be a deterministic sequence with Vni > f n2 > ■ ■ ■ > Vnn thus putting more 
weights on data closer to x. 

Example 2. Nearest neighbor estimate based on kernel. Take 
Wni = K{d{Xi,x)/ H)/ K{d{Xj,x)/H) where is a kernel function and H is 
the distance of the kth nearest neighbor to x. Mathematically, 

n 

H = M{heR:^I{XieB{x,h)}>k}, (3) {eqn:H} 

i=l 

where B{x,h) = {x' G J-" : d{x',x) < h} and /{.} denotes the indicator function. 
In this subsection, since we condition on predictors {Xj}, if is a known fixed value. 

Example 3. Nadaraya- Watson estimate. Take Wni = K{d{Xi, x)/H)/ Y^- K{d{Xj,x) / H), 
which has exactly the same form as in the previous example. However, here H is 
a predetermined value usually called the bandwidth parameter, not derived from 
distance of x's kth nearest neighbor. Typically, one applies the same value of H for 
all values of x. Thus compared to nearest neighbor estimate, the Nadaraya- Watson 
estimate is not adaptive to the local sparseness of data. In this subsection when 
conditioning on predictors and for a given x, of course Nadaraya- Wat son estimator 
is same as that in Example 2 since H is fixed in both cases. The differences will 
appear in the next subsection. 

Naturally we need the following assumption on the regression function to obtain 
nontrivial rates of convergence. 

Assumption 1: r is bounded and Lipschitz continuous. That is ||r(a;)|| < 
B,\fx E T and \\r{x) — r(a;')|| < M\\x — x'\\°'. 

In fact, since we only consider pointwise convergence, it suffices that r is Lip- 
schitz continuous on an open neighborhood of x. We nevertheless use the above 
assumption for simplicity in statements. 

Assumption 2: We assume Vni > Vn2 > ■ ■ ■ > Vnn- Moreover, for some integer k 
with k/n — > and kj \ogn oo, we have J2'i=k+i '^ni = and {J2'i=i '^niY^'^ = 

0{cn2) with bn,Cn2 0. ^4/50, we denote by H the distance to x from its kth 
nearest neighbor, and we assume if — )■ 0. 

Although Assumption 2 as stated is more amenable for use for k-NN estimates, 
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it can also be used for Nadaraya- Watson estimate, which will be clear in the next 
subsection. We also impose the following assumptions on the noise. 

Assumption 3: Given a convex increasing function ip with ip{0) = 0, and 
suppose for some constants C > 0, some concave increasing function with 
0(0) = 0,0(1) = 1, we have that < (j){il){Cx)) for some r > 2. Moreover, 
M := ||ej||^ < oo and the stationary sequence (ei,e2, . . .) is ip — m— approximable 
with decay rate {7^}. 

In the above assumption, the Orlicz norm is used for bounding the tail proba- 
bility of noises (Lemma [U (i)) as well as controlling the dependence. It is possible 
of course to use different ip for these two different purposes, but using the same 
ip seems to be most natural since it concern the same noises. The assumption 
x*" < (f){tlj{Cx)) deserves some explanation. By Lemma [1] (v), this implies that 
the r-th moment of the noise variable is finite, for some r > 2 and it is in par- 
ticular satisfied by iplx) = for p > r and iplx) = ipq{x) for g > 1. When a 
stronger ^-Orlicz norm is used. Assumption 3 imposes a stronger constraint, but 
the summability conditions in Theorem [1] below are easier to satisfy. 

Our main result for functional nonparametric estimates with functional re- 
sponses is the following. 

Theorem 1 // Assumption 1, 2 and 3 hold and if one can find sequences an 
0,Ln — )■ 0,x„ — )■ 0,m„ with mn an integer between 1 and n, such that (in the rest 
of the paper these sequences are simply denoted by a, L,x,m) 

(*) The four sequences, exp{— Ca^/(aL + m?c^2 + ^)} f^'^ some constant C big 

enough, 1/V^(a/x72/ (710^2) ), {m/a)/^^'^/'' {L/{2MmVni)), andl/i){a/{2nvnilm)), 
are all summable overn. 

Then \\f{x) — r{x)\\ = 0{bn + H°' + a + (7ifni)^''^) almost surely. 

Remark 1 Here we present a unified result for both nearest-neighbor estimate and 
the Nadaraya- Watson estimate. For nearest-neighbor estimate, k is a pre-specified 
constant and typically bn and c„2 are explicit functions of k and thus deterministic. 
On the other hand, H depends on k through ^ and thus depends on predictors. 
The situation for the Nadaraya- Watson estimate is exactly the opposite. H will 
be prespecified (typically as a function of sample size) and k is the number of 
predictors falling into the ball with radius H and thus depends on data. Similarly, 
Vni as order statistics of Wni depend on predictor values. 

Remark 2 Because of the requirement Y2'^^^exp{— Ca'^ /{aL -\- rri^ 0^2 + x)} < 00, 
we see that the sequence a cannot converge faster than mc„2 md thus we will focus 
on cases where this rate is achievable up to some logarithmic terms in the following. 
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Remcirk 3 For independent data, 71 = and the term {'jiVniY^'^ does not appear. 
More generally, this term can be ignored as long as f„i = 0(0^3); Remark 1 
above. As an example, we obviously have Vni = (^n2 = 1/^ for the simplest k-NN 
estimate with Vni = l/k,i < k. In the next subsection, one will see that for the 
Nadaraya- Watson estimate in Example 3 above, we also have that Vni and c^2 '^'^^ 
of the same order under mild assumptions. 

Remark 4 In the convergence rate, bn and represent the bias while a comes 
from the variance of the estimator. As presented above, which aims for gener- 
ality rather than clarity, it is hard to see what the convergence rate is in typical 
situations, and thus we discuss the rates in some special cases in the rest of this 
subsection. 

Independent case When the data are independent, l/V'(\/a;„/2/ (710^2)) and 
l/ip{a/{2nvni'ym)) are zero (Informally, 7m = when data are independent and 
we take -0(00) = 00. More rigorously, it can be seen from the proofs that these 
two terms are zero), and we can take m = 1, x = 0. Taking L = c„2 and 
a = (logn)c„2, the first sequence in (*) is then obviously summable. So as long as 
1/ {atp'^-'^/'' {cn2/{2Mvni))) is summable, we have convergence rate (logn)c„2- For 
the simplest nearest neighbor estimate with Vni — i/k,i < k, we have c„2 = l/Vk. 
The expression l/a/'0^-^/'' icn2/i2Mvni)) is simplified to Vk/ (^(logn)V'^"^/''(v^/2M)^ . 

For ip{x) — or ip{x) — exp{x^} — 1, this obviously is a restriction on k, in partic- 
ular that k should diverge fast enough at a certain rate. We note that by existing 
results on k-NN estimate for independent data with scalar responses, the vari- 
ance term is expected to be c„2 = which agrees with the rate here up to a 
logarithmic term. In summary, we have 

Corollary 1 For simplest k-NN estimate with Vni = l/k,i < k, ifY^^=iVk/^^~^^^{Vk/2M) < 
00 where M — then ||r(x) — r(x)|| — 0{H" -\- (log n)/y/k) almost surely. 

We note that for Nadaraya- Watson estimate in Example 3, discussions in the next 
subsection suggest that the convergence behavior is very much the same under 
reasonable assumptions. 

Weakly dependent case Here the convergence rate is determined by the in- 
terplay of ip and {7m.} in a more complicated way. For example, qualitatively, 
the summability of l/i{j{a/{2nvni'ym)) is easier to be satisfied the smaller is 7^, 
(weaker dependence). Moreover, the choice of x must take into account the trade 
off between the summability of exp{— Ca^/(aL + m'^c^^2 + ^)} ^ind the summa- 
bility of l/V'(\/^/2/(7iC„2)) (the former is an increasing function of x while the 
latter is a decreasing function of x). Similarly, the choice of m must take into 
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account the trade off between summability of (m/a) /'ip^~^^^ {L/{2MmVni)) and 
l/'?/'(a/(2nf„i7m)) (tlie former is an increasing function of m while the latter is 
typically a decreasing function of m). Ignoring the complication of choosing m, 
when iIj{x) = ipp{x) = exp{x^} — 1, the following corollary gives one possible 
situation where it is possible to set a = mc„2 up to an extra logarithmic term. 

Corollary 2 When ip = ippjP > 1, we have convergence rate \\r{x) — r{x)\\ = 
0{bn+H"+{\ognymCn2+{'JiVniy^'^) CLS long as l/'^(C(logn)^m/(n7m)) is summable 
for C large enough. 

Proof. Take x = C(logn)^c^2 (C* large enough) and L„ = C(logn)mc„2, the first 
expression in (*) is then satisfied if a = C(logn)^mc„2- Moreover, 1/iIj{^/xJ2/ (7iC„2)) < 
l/tp{C\ogn) is summable. Using the trivial inequalities c„2 > Vni and Vni > ^/n, 
we get m/a < n and thus {m/a)/ip^~^^^ {L/{2MmVni)) < n/^^'^^^(C \ogn) is 
summable. Finally, for the last sequence in (*), we have 

^l/i/j{a/{2nvnilmj) < ^l/^iC{\ognfm/{n-fm)) < oo, 

n n 

by assumption in the statement of this corollary. 

Finally, we note that in the above corollary, if 7^ = e'^"^ for some C > 0, then 
we can take m ~ logn so that all sequences in (*) are summable, and the rate of 
convergence is (logn)^c„2- 

2.3 On the properties of H and k with random covariates 

In the previous subsection, we treat the predictor as fixed and the convergence 
rate depends on the sequence {Xi}. Here we study the behavior of some of the 
quantities that appeared in the rates when Xj is a random stationary sequence 
in typical situations. Results obtained in this subsection can be combined with 
Theorem [1] to obtain more explicit convergence rates. The necessity of studying 
H (for NN estimator) or k (for Nadaraya- Watson estimator) is seen from Remark 
[T]in the previous subsection. 

When Xi are random, we will make use of the important quantity f{h) : = 
P{{x' : x' G B{x, h)}) which is called the small ball probability. Its importance has 
been demonstrated in [l^ for functional kernel regression with scalar responses. In 
particular, the use of (p{h) in a functional setting replaces the common assumption 
on the existence of a density for X when X belongs to some Euclidean space. It 
is easy to see that in the classical setting with mild assumption on the density of 
X G R*^, we have ip{h) ~ h'^. 

Nearest neighbor estimate. We only consider the simplest k-NN estimate 
as in Example 1 with Vni = l/k,i = l,...,k. Then in the convergence rates. 
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bn = 0, c^2 = J2i^ni = 1/^ maxj W„j = 1/k. Thus only the quantity H 
depends on {Xi}. If the sequence {Xi} contains independent elements, one can 
show H = 0{(f{2k/n)) almost surely as in the following proposition. 

Proposition 1 Suppose k/n — )• and k/\ogn — )• oo. Let H he the distance from 
X to its k-th nearest neighbor as defined in then P{H > ip^^{2k/n),i.o.) — > 0, 
where i.e. means "infinitely often" and (f~^{x) := in{{h : f{h) > x}. 

Proof. First we note that (f is right-continuous and non-decreasing and thus h = 
implies (f){h) > x. 
Denote c = !f~^{2k/n), p = (f{c) and thus np > 2k. We have 

P{H > (l)~\2k/n)) 
= P{J2nX^eB{x,c)}<k) 

i 

= P(y^ I{Xi e B{x, c)} - np < k - np) 

i 

< P{\^I{XiE B{x,c)} -np\>np/2) 

i 

< 2exp{ — -{np/2y/[np{l — p) + {np/Q)]} 

< 2exp{— Cnp}, 

where we applied the Bernstein's inequality for Bernoulli random variables (see for 
example Appendix B in jiij). Then P{H > 0~^(2A;/n), i.o.) — )■ can be shown 
using Borel-Cantelli lemma noting that k/ logn — )■ oo. □. 

In [l^, the authors distinguished two types of processes: the fractal type pro- 
cesses and the exponential type processes. The former is characterized by (f){h) ~ 
h"^, for some r > and the latter characterized by (f){h) ~ exp{ — (l/Zi'^i) \og{l / h'^^)} , ti > 
0, T2 > 0. The fractal type processes are similar to finite dimensional covariates 
in many aspects, while for infinite dimensional case such as when the covariate 
curves belong to some smoothness class, exponential type processes are more typ- 



ical. For example, the Brownian motion is of exponential type. The paper [27 
provides other more complicated Gaussian processes all of which are of exponen- 
tial type. Combining Proposition [1] above with Corollary [H we obtain the rates 
O {[(p~^ {2k /n)]'^ + {log n)/\/k) for independent data. When the optimal k is chosen, 
it is easy to see that for exponential type processes the convergence rates are loga- 
rithmic in the sample size, much slower than the classical finite-dimensional cases. 
Also note that this slow rate is largely determined by the term [(p~^{2k/n)]" which 
converges to zero logarithmically whether k increases logarithmically or polynomi- 
ally in n. 
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For weakly dependent sequence {Xi}, in particular assuming {Xi} is ip — 
m— approximable with Xj-™''')|| = PmjYlm=i f^m < oo (a minor extension 

to Definition [1] is needed here since Xj G J-" which is not a normed space, thus we 
need to use d{., .) instead of Xi — x["^^), we can show the following proposition 
whose proof is deferred to the next section. Note that although we used the same 
notation as before, ip here are different from that in Assumption 3 since here we 
are considering the predictor sequence instead of the noise sequence. 

Proposition 2 Suppose for some h > ip^^{2k/n), there exists some sequence 1 < 
m <n such that k/n — )■ 0, k/{m Xogn) — > oo and Yl^=i "i^/ "^{{h—ip'^ {2k /n)) / f3m) < 
oo. Then we have H < h for n large enough, almost surely. 

Nadaraya- Watson estimate. Here Wni = K{d{Xi,x)/ H)/ J2iK{d{Xi,x)/ H) 
and we only consider the simple case where kernel function K satisfies cJ[_ < 
K < 1] for some C > c> 0. Unlike k-NN estimate, here H is predetermined. 

In Assumption 2, we let k be the number of covariates inside the ball B{x, H) 
and thus if Xi is not one of the k nearest neighbors of x, we have Wni = 
and thus 6„ = Ylk+i '^ni = in the convergence rate in Theorem [H Since H is 
predetermined in Nadaraya- Watson estimates, the only quantity in the convergence 
rates that depends on Xi is f„i = maXiWni and c„2 = C^i^niY^'^- Since Vni < 
C / K{d{Xi, x)/H) < C/ck, Vni > c/Ck as long as A; > 1, and c„2 ~ l/Vk 
which can be easily shown, we only need to study the asymptotic behavior of k, 
the number of predictors inside the ball B{x, H). 
With {Xi} an independent sequence, we have 

Proposition 3 Suppose if — )■ 0, nip{H)/ logn — )■ oo, then nip{H)/2 < k < 
2nip{H) for n large enough, almost surely. 

On the other hand, for a ■j/'—m— approximable sequence {Xi] with X;[™^) ||^ = 

we have 

Proposition 4 Suppose H" and H' are two sequences with H' < H < H" and 
there exists a sequence 1 < m < n such that rnp{H') / (mlogn) — > oo, J2'^=i n/ip{{H"— 
H)/ (3m) < oo and J2'^=i n/ip{{H — H')/ (5^) < oo. Then we have n({){H')/2 <k< 
2nip{H") for n large enough, almost surely. 

The proofs for these two propositions are very similar to those for Propositions [1] 
and O and thus omitted. 
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3 Proofs 



Based on two different representations of tlie nonparametric estimate in ([T]) and 
(E]), we decompose \\'r{x) — r{x)\\ into the bias term and the variance term, 

||r(^)-r(x)|| < \\J2vnMXnJ-rixm + \\J2Wn^e^\\. (4) {bv} 

i i 

The bias term is easier to deal with. In fact, 

n k 

\\Y,Vm{r{XnJ-r{x))\\ < 25 + || ^ t;„,(r(XHj - r(x)) || 

i i=k+l i=l 

= 0{bn + H''), 

by Assumptions 1 and 2. 

Now we deal with the variance term. Let rji = Wni^i, Sn = Yl^=i'Ui ^-nd the 
following arguments are conditional on (in effect treating Wni as 



nonrandom weights). Following the idea of Section 6.3 in [19|], we write ||5'„|| — 
-^ll'S'nll = II YJi=iVi\\-E\\ YA=iVi\\ = Sr=iei' withe, = -E[||S'„|| I^J-EfH^nH 
where Qi is the a— algebra generated by ei,...,ei {Qq is the trivial a— algebra). 
It is easy to see that {cj} is a real-valued martingale difference sequence which 
potentially enables us to use relevant exponential type inequalities. However, in 
general it seems at least not easy to obtain directly appropriate moment bounds 



for di in order to apply, for example. Lemma 8.9 in 26l| (Bernstein's inequality for 



martingale differences), and thus we instead work with the quantity 

di = -E'[||S'„|| |^i]-E[||S'„|| \gi-i]-E[\\Sn-r]i ^i+m-i|| \Gi]+E[\\Sn-r]i rji+m-iW 



where m is same as that in the statement of the theorem and, as discussed in 
Remarks following the theorem, need to be chosen appropriately (as a side note, 
m = 1 suffices for independent data in which case we actually have di = Ci). If 

i+m — 1 > n, the expression Sn—rji ■— ?7i+m_i is taken to mean Sn — Tji — - rjn- 

Obviously di is still a martingale difference sequence. We denote fi = E[\\Sn — Tii — 

r]i+m-i\\ - E[\\Sn -rji rji+m-iW \Qi-i\ and thus ei = di + fi. 

Lemma [2] shows that 

i+m—l i+m—1 

W\< WnjE{¥jm)+ W^M\^3\\%-i)^ (5) {eqn:dl} 

j=i j=i 

and 

i+m—1 

E{d^\g,_^)<m Y WnJm^J\\'\Q^-l)■ (6) {eqn:d2} 

j=i 
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Lemma [3] shows that 

^(5Z/*>a) <2/V^(a/(2nt;„i7j). (7) {eqn:f} 

i 

Lemma m shows that 

£^||S'„|| = 0(c„2 + ^/liVni)- (8) {eqn:es} 

Aided by these resuhs, we can bound the variance term in three steps. 

Step 1: Let d[ = dil{\di\ < L) for some L > 0. We have P{Y.l=Mi - 
E{d[\Gi^i)) > a) < exp{-CaV(ai^ + m'^cl^ + x)} + 1/^) {^/x / {^/2-^iCn2)) ,Va > 
0,x > 0. 

Let %l){x) := %l){y/x). By Assumption 3, is convex and increasing and thus 
defines an Orhcz norm. Using (E]), we have 

n 

1=1 

1=1 j=i 
n i+m—1 

i=l j=i 

ifl y~i i,-\-Tn — 1 

< 2m^E{\\e,r)Y.Wl, + 2Y: E ^^n.^dl^ ^ 4'"^'^ fl^-i)^ 

i=l 1=1 j=i 

where in the last fine above we use that e^j ^'^^^ is independent of Qi-i, and also 
use the inequality ||ef "^^'^ + e, - e^-'^'^ f < 2\\e^/-'-''^ f + 2\\e, - e'f-'^'^ f which 
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follows from the parallelogram identity. Furthermore, 





n i+m—l 
i=^ 'i=i 






< 


n i+m—l 

1=1 j=i 








< 


n i+m—l 
i=l j=i 


11^ ^(i-«+i)||2 
II fcj II 






< 


n i+m—l 

2E E K 

1=1 j=i 




2 




< 


E^-' 









where we used Lemma [T] (vii) for the second inequality above and Lemma [T] (vi) 
for the third inequality above. Then, using Lemma [1] (i), we have for any x > 

n 

P{J2E{d',\g,^i) > 2m^E{\\ei\\')cl, + x) < l/V^(v^/(v^7iC„2)). (9) {eqn:R} 

i=l 

Using \d[ - Eid[\g,^^)\ < 2L and E[id', - Eid[\g,.,)y\g,^^] < Eidf\g,.,) < 
E{df\g,,^), we get - E{d'^\g,_^)\'\g,_,) < {2Lf-''E{dj\gi_,),\/k > 2. Since 

d[ — E{d[\gi-i),i < is a martingale difference sequence, using Lemma 8.9 in [26| 
(Bernstein's inequality for martingales) together with ([9]), we obtain the desired 
bound as follows: 

n 

PC£{d[-E{d[\g,_{))>a) 

i=l 

n n 

< P(5^«-E«|6;,_i))>aand ^ < 2m2E(||eif + x) 

i=l i=l 

n 

+P(^E(rf^|g._0>2m2E(||eif)4 + x) 

i=l 

< exp{-Ca^/{aL + rr?(?^^ + x)} + Xj^ (v^/(V27iCn2)) • 

Ste^ 2: Let d'l = di - 4 = dil{\di\ > L}. We have P{^- \d'l - > 
a) < Cm/ (a^i-V^ {21^))' ^^ere M = \\e,U. 
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From ([5]), we have that 

i+m—l i+m—l 



\d4\^ < 2 ^ Wnj\\e,\\^ = 2M J2 



j=i j=i 



and thus using Lemma [T] (i) and (v) 



and 

i+m—l 

HWr <C\\d,\\^<C J2 



Using Holder's inequahty, we have 

Eiidl - E{d'!\g..,)\) 

< mm 

= 2E{\di\I{\di\> L}) 

< 2{E{\d,ny/^p{\d,\ > Lf-'/^ 

i+m—l 



i+m—l 

j=i 'y^j.y± " nj ^ 



2MZT=7 

i+m—l / T \ 



and thus, using Markov's inequahty, we have -P(X]j Mi' ~ -^("^i'l^j-i) I > o) ^ 

E[Y.^ \d'l - E{d'l%^{)\]/a < Cm/ (a^^-V^ {^j^). 

Step 3: Finally, we demonstrate the bound for the variance term in 

Using E{di\Qi^i) = E{d[\Qi_i) + E{d'-\Qi_i) = 0, we have that di = d[ — 
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E{dfi\g,-i) + « - and then 

P{\\SJ-E\\SJ >3a) 

i 

< P{Y^d,>2a) + P{Y^f,>a) 

i i 

< P($^« - Eid[\g,.,)) >a) + P( J]« - Eid':\g,^,)) >a) + P( J] /. > a) 

i i i 

< exp{-Ca^/{aL + m^c^a + x)} + l/ip (^^/x / {V2-fiCn2)^ 

^Cijnl a) 0,^-^1' (2M^) + ^^"^ («/(2nt;„i7m)) , 

by the previous two steps and ([7j). The above expression is summable by as- 
sumption of the theorem, and an apphcation of the Borel-CanteUi Lemma leads 
to ll^nll — -Ell Sn, 1 1 = 0(a). Combining this with ([8]), the variance term is thus 
||5'ni| = 0{a + c„2 + (7i'i^ni)^^^)- As noted in Remark 1 following the theorem, the 
term c„2 can be omitted since we always have c„2 = 0(a).n 

Lemma 2 Using the notation in the proof of TheoremUl we have 

i+m— I i+m— I 

\d^\< E ^(11^.111^^)+ E ^(ll^.lll^-l)' 

j=i j=i 

i+m— I 

j=i 

Proof Since di = -E(||S'n||-||S'„-X;}tr~Sill l^i)--^(ll'5„||-||5'„-X]}tr~Sjll l^i-i)^ 
the first equation is obvious. 

Denote = E{\\Sn\\ - ||5'„ - Yl'jtli''^ then di = - E(^j|^i_i). Using 

the interpretation of -E(^j|^i_i) as the projection of ^j, we have 

i+m— I 

E{d^\g^.^) < E{^^\g,_,) < m J] E(h,|n^,_i), 

j=i 

proving the second equation. □ 
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Lemma 3 For fi = E[\\Sn-r]i Vi+m-i\\ \Qi]-E[\\Sn-r]i Vi+m-iW \Qi~i\ 

as in the proof of Theorem [H, we have 



Proof. By the definition oi ijj — m— approximabihty for sequence e^, we have that 



^[||W^„iei + ■ ■ ■ + Wm-iei-i + Wni+met^l + ■■■ + Wnne^n~'^\\ m 
= E[\\Wnie, + ■■■ + Wm-ie,., + W^.^J^^^ + ■ ■ ■ + W^J^-''^ \\ 

Thus fi = f[- where 

-E[||W^„iei + ■ ■ ■ + W„._ie,_i + W^™+™e;™i + ■ ■ ■ + Wr,r.e^:~% \Q^. 
//' = E[||l^„iei + --- + W'™_iei_i + lKi+mei+m + --- + Vr„„e„|| l^i-i] 
-E[||PV„iei + ■ ■ ■ + iy„._ie,_i + iy™+™e;™i + ■ ■ ■ + W^^J^'^ 

Since < E{\Wni+m^i+m - Wni+me^i^Ul + ■■■ + Wnn^n " Wnnen''^ \\ US- 

ing Lemma □ (vn), we have ||//||^ < (maxi<j<„ iy„j)7m and thus || YJi=ifi\U < 



and the Lemma is proved by combining the above two displayed equations. □ 

Lemma 4 Let Sn = Yll=i Vi = XlILi ^m^i as in the proof of Theorem d we have 
E\\Sn\\ = O (c„2 + a/7i maxi Wni) ■ 




n(maxi<j<„ iy„j)7m. Using Lemma [T](i) we get 




By exactly the same arguments 
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Proof. We have 



= iE\\J2Wn.e.\\r 

i 

< E{\\Y,Wr.M?) 

i 

i j 

n n 



i i=l j=i+l 

n n 



i=l j=i+l 



U-i) I 



1=1 j=i+l 
n n 



(j-«)||2Nl/2 



1=1 j=i+l 
n n 

1=1 j=i+l 

where we used that ej and ej"' *■* are independent, and Assumption 3 onip. Finally, 
we see that Y.7=i E"=i+i WniWnj\\ei\\4ej-e^j'~'^y, < (max^ Wnj)(J2i ^ni) Em=i 11^1- 
e?"'^ 11^ = (71 maXiM/,i). □ 

Proof of Proposition\^ Consider the approximating sequence {X^\x'^\ . . .). 
Define the zero mean random variables Y^'^'^ = I{X^"^^ G B{x,c)} — p where 

c = (p~^{2k/n) and p = (p{c) = 2k/n. Divide the sequence (1^/"^^ ^2*'™'*' • • •) i^^o 
groups (we assume n/m is an integer for simplicity in presentation without loss of 
generality) as follows: 



group 1 
group 2 

group m 



"v^M -v^(»") "v^M 

-'I ? 1+m? ^ l+2m5 • • • 5 -f l+(n/m-l)m' 
-^(m) -^(m) -^(m) -i^M 

-'2 5 -'2+7715 -'2+2m5 • • • 5 -'2+{n/m-l)m' 



-^(m) -^(m) T^M -i^M 
-' ™ ? -'^ 2m 5 -' 3m 5 • • • 5 -' " 



Because of the construction, the random variables within one group are indepen- 
dent of each other. Let Zi,i = 1, ... m be the sum of random variables within each 
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group. Using Bernstein's inequality, we have 



and thus 



Pi\Zi\ > x) < 2exp{-^x^/{np/m + x/3)}, 



PiJ^nxi"^^ eBix,c)}<k) 
1=1 

n 

< P{\^{I{X^^^ eB{x,c)}-p)\>np-k) 



1=1 

m 



1=1 



< mP{\Zi\ > {np — k)/m) 

< 2mexp{ — -(— Y /{np/m + {np — k)/{3m)) 

= 2m exp{-(3/14)A;/m}. (10) 
We also have that 

n 

P(H > h) < P{Y^ I{Xi e B{x, h)} <k) 

i=l 

n 

< PE/{X^)G5(x,c)}<fc) 

i=l 

+P{3i, s.t. Xf""^ e B{x, c) and X, ^ B{x, h)) 

n 

< P{J2 ^{^t^ ^ B{x, c)}<k) + P{3i, s.t. d{xl"'\Xi) >h~c) 



i=l 



< PiJ^nxl""^ ^B{x,c)}<k)+n/^{{h~c)/PJ (11) 



i=l 



< 2m exp{-(3/14)A;/m} + n/V^((/i-c)//3„), (12) 

where we used Lemma [1] (i) in ffTTl) and used (|TOl) in (|T2l) . The lemma follows from 
the Borel-Cantelli Lemma. □. 
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